For Reference 


NOT TO BE TAKEN FROM THIS ROOM 


Gx ipnis 
UNITASTAAAS 
AUBERTAEASIS 


The University of Alberta 
Printing Department 
Edmonton, Alberta 


Digitized by the Internet Archive 
in 2023 with funding from 
University of Alberta Library 


https://archive.org/details/Whitehurst19 74 


Oke Pah es USN ee ME ar ees acl oN Oe Pie aes gods eee 


RELEASE FORM 


NAME OF AUTHOR : Anthony R. Whitehurst 
TITLE OF THESIS : The Perceptual Role of Voice Onset Time 
DEGREE FOR WHICH THESIS WAS PRESENTED : Master of Science 


YEAR THIS DEGREE GRANTED : 1974 


Permission is hereby granted to THE UNIVERSITY OF 
ALBERTA LIBRARY to reproduce single copies of this 
thesis and to lend Oresell such copies for private, 
scholarly or scientific research purposes only. 

The author reserves other publication rights, 
and neither the thesis nor extensive extracts from it 
May be printed or otherwise reproduced without the 


author's written permission. 


wr hi 
soHBRCE 36 sedmell = IBTUREENS GAH GREIEE von S08 an 
ster + awrwata sanDa0 cua 


j 
a 
a 


© Yrewinviey. FHT of isanee yderod as cteaint 
ads 4c 201409 olpnta aqwhorge Gt ¥aAPRALI AMNIAIA 
.ssnviad 307. eeig¢on foua Lise 20 Bont oe hes =feorta 
iho 2seoqirq dovasset DItisaatoe zo yisetodse 


“7. =< 


THE UNIVERSITY OF ALBERTA 


THE SPERCEPTUAL ROLE OF VOLCE ONSET “TIME 


by 


(C) Anthony R. Whitehurst 
\ MV 
Winn” 


A THESIS 
SUBMITTED TO THE FACULTY OF GRADUATE STUDIES AND RESEARCH 
IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE 


OF MASTER OF SCIENCE 


IN 


EXPERIMENTAL PHONETICS 
DEPARTMENT OF LINGUISTICS 


EDMONTON, ALBERTA 


SPRING, 1974 


ae 
- : 
a - mar 
eremir A : 
ROUAEGSA. GHA RAXOUTE STAOARD 10 YRIVOKY Sn OF nal 
QNNDAC Si? AOD STMINARTUNAT aT ID THIMATBIOT IK 
aDuttoe go Rwn2eM 4D 


THE UNIVERSITY OF ALBERTA 


FACULTY OF GRADUATE STUDIES AND RESEARCH 


The undersigned certify that they have read, and 
recommend to the Faculty of Graduate Studies and Research, 
for acceptance, a thesis entitled "The Perceptual Role of 
Voice Onset Time", submitted by Anthony R. Whitehurst in 
partial fulfilment of the requirements for the degree of 


Master of Science. 


bue chess Syed vod Seid idaso beuptexsbae edt | | 
ee ee 
20 slot Lauyesore? oft” ot 
on | “apis a nnncatiaeal taal 5 
—_ 


q , i = Sor 3 Pe — 
| i . 


——— 


a 


ABSTRACT 


The perceptual impact of Voice Onset Time (VOT) was 
examined in an experimental design including three factors: 
place Of articulation,.manner sof p,articulation, -andwwhiteé 
noise masking. The perceived voiced/voiceless distinction 
was upheld throughout the entire range of masking conditions, 
while a significant place by masking interaction was 


observed. 


High levels of white noise masking were also ac- 
companied by the contusion of place of articulation. Such 
confusions gave rise to a re-evaluation of the present 
Synthetic stimuli in acoustic (pre-linguistic) terms. 
Variations in a number of stimulus features in the frequency 
domain were observed to correspond closely to changes in 
VOT. MsucuGesiwLt Of ,this observation jin «conjunc tionwwa th 
earlier studies of both speech and non-speech auditory anal- 
ysis, it was suggested that at least some of those 
associated stimulus features may be analyzed by the listener 


in a non-linguistic mode. 
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CHAPTERS 


INTRODUCTION 


Preliminary Considerations 


The primary goal of this study was to examine in some 
detail certain aspects of the perceived 'voiced-voiceless' 
manner distinction in initial stop consonants. Although 
some consideration was given to several studies which were 
believed to be related to the issue on a general level, 
most serious attention was given to those which dealt more 
specifically with the notion of voice onset time (VOT), the 
time interval between consonant release and the onset of 
glottal phisimg. eiVOT.aiss atvecentyadditionwtontheslast tof 
features which are assumed to be in some way influential 
in the perceived difference between English 'voiced' (b-d-g) 


and 'voiceless' (p-t-k) categories in initial position. 


The statuswot VOT has? rapidly tevolved, inom what of 
a descriptive parameter (Lisker & Abramson, 1964) to being 
indicative of a biologically significant aspect of human 
neural structure alhimasssCorbit, bo isitei fim accordance 
with the logical and chronological development of the 


status of VOT, several major assertions have been offered: 


Analytical claims 


la. For all languages which exhibit at least two 


perceived manner categories of initial stop 
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consonants, the perceptual manner categories 
can be differentiated along the single dimension 
of VOT. In this sense, VOT has been regarded 


as a universal phonetic dimension. 


lb. Other proposed features are to be regarded 
Simply as consequences of VOT, and therefore, 
need not be invoked in the categorization of 
intieval stop, consonants, edhustVOlis afisut— 
ficient perceptual cue for the observed 


qiustine tions 
Biological claims 


2a. All speakers must share some common means by 
which this universal dimension may be sub- 


divided into perceptual classes. 


2b. Experimental evidence interpreted as supportive 
of linguistic VOT distinctions in pre-verbal 
infants led to the hypothesis that the mechanism 
for VOT discrimination is an innate human 
neura lL, complex), aje67..ea)l Speci fics Vdinguistic' 


feature detector mechanism. 


Examination of the few critical reports available 
has led to the conclusion that the above assertions may be 
premature. Furthermore, it has become clear that VOT, 
evidently treated as the sole necessary and sufficient cue 


in the perceived voiced-voiceless manner distinction has been 
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elevated to its present level of importance without having 
been observed under a sufficiently broad, range of strictly 


defined experimental conditions. 
Background of the Problem 
The "Voiced/Voiceless" Distinction 


English stop consonants are traditionally classed 
as either "voiced" (b-d-q) or "voiceless" (p-t-k). This 
descriptive classification was originally based on the 
presence or absence of low-amplitude glottal pulses during 
the interval Of Oral occlusion, just prior to the consonant 
release. Although, in initial position, the presence of 
Ehesrvolottals "bugz’. (or “voice: bar »~(as) 1 has been more 
recently termed in spectrographic analysis) would clearly 
indicate a voiced stop, it has been noted that this feature 
is not necessarily present in the case of certain English 
initial stop consonants which may nevertheless be perceived 


as voiced. 


Because of such cases, it became necessary to con- 
script the use of another (previously secondary) feature, 
aspiration, to distinguish between the English voiced and 
voiceless stop categories in initial position. Aspiration 
can be seen in a spectrographic display as a noise compo- 
nent, appearing after the consonant, release, whose frequency 
distributions ss. generally spread “over the range or the 


second and third vowel formants. The presence of such a 
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noise component was cited as indication of a voiceless stop 
GOnsonant. As a result, the’ added consideration Of either 
voicing or aspiration was found to yield more consistent 
distinctions between the two manner categories. However, 
it became evident that for various languages, including 
English, the occurrence of these two key features was some- 
what unpredictable, in that on the one hand, initial stops 
perceived as voiced might or might not exhibit a distinct 
voice bar, while on the other hand, those perceived as 
voiceless might or might not exhibit a distinct amount of 


aspiration in final position, or before unstressed vowels. 


A third distinction between the stop consonant 
categories was based on acoustic qualities noted by 
Fletcher (1929). It was introduced as the 'tense/lax', or 
"fortis/lenis' distinction. Fletcher discussed the two 
phonemic categories in terms of their relative amplitudes, 
and noLed that the voiceless (crense) category (p-t-k) Showed 
consistently higher audibility than the corresponding 
voiced (lax) category (b-d-g). In the course of his 
experimentation, he noted that the number of decibels of 
difference required for complete attenuation of a given 
fortis phoneme was consistently higher than that required 


for 1tS lenis counterpart (fletcher, 1929). 


Jakobson, Fant, and Halle (1952) mentioned an 
apparent difference in the production of tense and lax 


consonants... "Tense consonants are articulated with 
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greater distinctness and pressure than the corresponding 
lax phonemes (p. 38)." They proposed that previous dis- 
tinctions could be viewed as largely redundant in light of 
thismeontrast? aThus;ethertense/lax, dastinetionraloneawas 
employed by Jakobson and his associates, in their phonol- 
ogical system, to separate the two groups of English 
phonemes. The proposed contrast was included in their set 
of "twelve binary oppositions" which they described as 

- . . "the inherent distinctive features which we detect in 
the languages of the world and which underlie their entire 


lexrvecal and morphological stock << 2.) (o. 40) 2." 


The nature ofs,the tense/lax feature, however, must 
be questioned when referred to as dichotomous. This is 
especially true in light of the obviously continuous scales 
offered for the determination of tenseness/laxness, such as 
distinctness, pressure of articulation, muscular strain, 
tension, deformation of the vocal, tract from neutral 
position, and segment duration. It should be Tenioned 
that although this particular distinction bears a certain 
amount of interest in the present study, many have taken 
exception to the Jakobsonian phonological system, often 
solely on the basis of the highly arbitrary nature of 
features like tenseness/laxness, and on the grounds that 


Stich teatures: may not truly be binary (see Fant, 1967). 


Malecot (1970) argued in support of the tense/lax 


opposition, per se. Dealing strictly with physiological 


ee yy, 


: naan, 
dante esto 


, (BE oneal 


= on am & ¥) @ 

7 - 
patbnodeast02 sits natit ot 
~oib evotvyero dedj hoe oqosd: xs 


7 ay 
tn dHpil al tnabasthes vite Sate ‘eA bewsiv ed bluos enots ants 


‘ SaRrsnOD pide 
_ 


: 
7 _ Ld me 
~fomora tiedy ni ,eotsin0Bas Sin bas noedodtstl, yd beyo. ne 


asw snela sotsonisath xaf\eened ett evar 


gjeteqea OF ,moJeyYe {soit 
= 


fetfpas Yo aguetp owt efi 
462 tledd ni Sebeiont sw tebxjaos Bezoueng eT , Semen 


7 ; ' Fr < bs « 
an bedirseeb veut ddidw “enoidieogqge yrentd er 


4oed0h sw foittwy eozptss? ovidoniterd dretetnk oo" sow 
4yt9no vieds ativebow dAsinw Ses Slaw edit to aspeuvprel en: 


i = q 
*- (Ob a _ dooge Lenieolonétam Sas Isoisel 
- « a -_ ? 


: 
’ a 7 

saum ,~tevowod ,sritss? x«s8f\sertes and Yo etmben an 
a7 a 
7 
ef elt! -.euommsodolb an of bexxetan morlw banoite oup oF 


asians eiodattnon Yluucivdo |42 to teiptl mi ouza vtteional 
: : a 
an rouse .peonmxsl\apetsansd to nofiemimaieS aft 102 beset D 


niexwte thfvyorun \noOltjaslvoigss to otreseag ,asontpakd 
- r . * . i 
Esasres: port gost? Leoev std to noivemrroteb .notenes 

. , - : SP 
Barnariinen od. Gisore +1 nofttatwe 2asmoes fine snotsieog 


nfsizen s ethed aoitoatigaisS tsipoisasq ei fundlg nay 
_ - 


geist syed yism .yhuze sneees¢ sid. nt Jsateini To dnvom 
2 7 


nadig .~eseve lenlvolodora netacadodst. stig of rotsge Ks 


Yo otdsan vietJidws yitwat soit ic efeed oad co ylete 


den’ ebrowerp ait ae bis azomeal\enonoans mkt > 


7 (raed sane aia yasnie co von Son vet co 
; "S54 —. 


“aan ass 


aye -.* 


- 
as 


evidence, he concluded that the opposition, based on a 

Seca leporsetorcerotiarticulatvon twas @pparentiymis? thea 
Linguisticwrealitywbhutris primariiyla synesthetic response 
to intrabuccal air pressure impulse, with closure duration 


perhaps "playing a’ secondary role (pt"1591).." 


bE this distinetion, still common’ in phonemic de— 
scriptions, were, in fact, independently capable of 
accounting for all English (p-t-k) : (b-d-g) contrasts, 
Lietiwrererence COsvoicing (voice bar) or aspiration in the 
differentiation would be rendered clearly redundant. But, 
Besotedsby Lasker and Abramson, (1964)7) 2...) (2b die too 
often the case to be accidental that voiceless and aspirated 
stops are discovered to be fortis, while voiced and un- 
aspirated ones are at the same time lenis (p. 386)." 
Ultimately, it would seem only reasonable to ask what 
dimensions are actually in question in the perceived "voiced- 
voiceless" opposition, and further, what boundaries on those 
dimensions delimit the two perceived manner classes. With 
respect to the tense/lax opposition, neither of these 
questions has been unambiguously answered, leading one to 
conclude that the tense/lax distinction, so often co-operative 
With vVoO1cing and aspiration, Jacks )physical correlates 


which are truly independent of the other two features. 


Questions regarding the isolation of relevant para- 
meters, and notions of perceptual boundaries, might best 


be dealt with in terms of the so-called "categorical speech 
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perception" hypothesis) .as iet.forth, by: diiberman} etal. 

WlO> 765. 958)s = Their notion was based on) the premise that 
speech stimuli drawn from a physical continuum are perceived 
as members of discrete categories. Various subsequent 
experiments, conducted mainly at Haskins Laboratory, were 
aimed at the discovery of the relevant physical (physio- 
logical and acoustic) parameters upon which the language- 


user's perceptual processes rely. 
The Concept of VOT 


Lisker and Abramson (1964) began a series of studies 
in an attempt to find the dimension upon which the perceived 
voiced/voiceless distinction might be conclusively dependent. 
They proposed that Voice Onset Time (VOT) was the primary 
feature in the observed distinction, and defined VOT as the 
Seow. CULALLON OF the time inverval by which the onset of 
periodic pulsing either precedes or follows release (p. 387)." 
It was held by Lisker and Abramson that this fundamental 
timing relationship was the result of articulatory and 
glottal adjustments which were also responsible for the 


predictable co-occurrences of aspiration, voicing, and 


articulatory force (tenseness/laxness). 


Implicit in Lisker and Abramson's position was the 
assumption that VOT was a primary unit of production, and 
that other features were simply its consequences. When 
stated in such terms, their interpretation of VOT could be 


Sasaly. inbroduced,anto the Motor Theory of Speech Perception 
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asa" .. . possible link between perception and articula- 


trvon (pp. 4205)? 


Evidence Proposed for the Universality of VOT 


In a study which involved eleven languages, 
Lisker and Abramson (1964) elicited identificantions of 
stimuli drawn from natural speech. The languages observed 
were chosen from those which distinguished among two, three, 
or four consonant manner categories in initial position, 
before a vowel (Table 1). The fundamental issue in 
Lisker and Abramson's experiment was whether VOT alone could 
be used to determine phoneme class membership in any of the 
two, three, or four category languages. Their results would 


seem to indicate that this may not be the case. 


The perceptual phoneme categories of the six two- 
cateogry languages were sufficiently distinguishable from 
one another on the basis of the features of voicing and 
aspiration, as shown in Table 1. Furthermore, the phonemes 
of the first two of the three-category languages, Eastern 
Armenian and Thai, were equally differentiated. However, 
according to Lisker and Abramson, a third feature, length, 
was reguired to differentiate between Korean's two classes 
Of weak ly and strongly aspirated voiceless initial stops. 
Finally, the initial stop consonants of both Hindi and 
Marathi, the four-category languages, were also classified 
Satisftactorily on the basis, of only the) two teatures ot 


voicing and aspiration. 
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TABLE 1 


OCCURRENCE OF INITIAL STOP CONSONANTS 


IN TWO-, THREE-, AND FOUR-CATEGORY LANGUAGES 


Voiced? 


Aspirated? 


English 
Cantonese 
Tamil 
Hungarian 


Spanish 


Dutch 


Three-Category Languages 


x x x 
x* x x 
x x ** 


Four-Category languages 


Hindi x = bie x 
Marathi x x x x 


*Dutch and Thai do not exhibit an initial voiced velar. 


Es Armenian 
Tha 


Korean 


**Korean bears a distinction between weakly— and strongly- 


aspirated voiceless initial stops. 
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VOT. production values: for initials stopsconsonants 
in all of the two category languages, plus Eastern Armenian 
and Thai, exhibited relatively distinct category divisions 
along the VOT continuum. Korean, however, presented a 
datferents case .-, Theydistribution.of,VOTis. for: they Korean 
initial unaspirated and "weakly-aspirated" stops showed a 
Ssigniticant degree of VOT overlap, while the distribution 
for,— Strongly—-aspirated” stops clearly stood alone, Lasker 
and Abramson defended VOT, stating that " .. . while the 
distribution of values is thus somewhat anomalous, we 
cannot say with reasonable assurance that our measure of 
voice onset time fails to separate the three categories of 
Korean stops; it will certainly suffice to distinguish the 
aspirated set from the other two and it may still well be 
the single most important measure for separating the latter 


(1964, p. 403)." 


Kim (1970) proposed, on the basis of cineradiographic 
evidence, that aspiration is an independent factor in the 
distinction of Korean initial stop categories. Kim found 
aspiration to be highly predictable, based on the size of 
the olottal opening at consonant rellease. “Like Lisker “and 
Abramson's approach to the articulatory foundations of VOT, 
Kim explained aspiration as a laryngeally controlled 
phonomenon. By showing a high correlation between size of 
glottal opening during consonant articulation and aspiration 
in the three Korean initial manner categories, Kim avoided 


the difficulty encountered by the Lisker and Abramson timing 
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relationships \Linidiscussing!theirdrestlts; tdiskernand 
Abramson (1964) did not attach any significance to the 
Pessibi li#evathat@as  inakorean, laspirataon) Loriatimieast 
Rimbsvtype, Ofminterpretation oftityhmightthavewbeenoammore 
rebieblelandex i than VOT inedistinctionsd among initial 


Korean voiceless stops. 


A second observation in exception of Lisker and 
Abramson's hypothesis is not unlike the foregoing. Both of 
the four-category languages, Hindi and Marathi, presented 
Overlapping VOT distributions in the production of aspirated 
and unaspirated voiced stops, while the voiceless unaspirated 
and voiceless aspirated stops composed the two remaining 
modes@inwthelsimilarvtrimodal VOT: dastributdons2meAss inv the 
case of Korean, VOT may have been an insufficient measure 


in phoneme manner categorization. 


With respect to the possibility of alternative 
explanations, their interpretation of results appears to be 
somewhat restricted: "To be sure, the voiced unaspirated 
and voiced aspirated stops show differences in average 
values that are almost systematic; nevertheless, they occupy 
Tanges that, are nearly co-extensive (ps 403) ." The expiana- 
tion offered in the: case of Hindic and) Marathi: wass'as’) follows: 
brteseems very dikelysthat! theivoicedtiaspixatespancs dusr 
tinguished from the other voiced» category bythe! presence 
of low amplitude buzz mixed with noise in the interval 


following releases of theustop (p. 403)" de Given that) both 
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Gategoriessfall on the negative side of the VOT: continuum 
(voicing precedes consonant release), it is difficult to 
(stile such an explanation. ~t "low amplitude buzz. is to 
be equated, in this context, with the glottal. waveform, then 
the distinction should have been reduced to one of aspira- 
uton.* In any case, the ambiquity of this explanation is 
SULELCLeEnt to Sudgest that, aS in the case of Korean, 
additional measures, besides VOT, might have been more 


reliably enlisted in the observed distinction. 


Additional counter-evidence has recently been pre- 
sented by Caramazza, Yeni-Komshian, Zurif, and Carbone 
(1973). They examined VOT with respect to both the per- 
ception and the production of initial stop consonants by 
Canadian French, English, and bilingual French-English 
speakers. In both production and perception, the Canadian 
French speakers showed substantial VOT overlap for three 
classes Clabial) apical, and velar). of, initllalgstop. consen— 
ants. Caramazza et al. concluded that these results "... 
Strongly suggest that VOI is nota Surliicrent cue tok the 


perception Of voicing distinctions in Canadian French 


(O24 26) 05" 


These authors also believed that their results 
warranted consideration with respect to the proposed univer- 
Sality of VOM... They, reasoned. that, in. the. case, of: Canadian 
French, their, data." -jn..,cast doubt.on eny,tneory.assign— 


ing VOT a universal status in the total determination of the 


smiiqas to sntot beother ees 
et moltsméigxs atds te vsinphaehi ile {Sand YRS AT enolkd 
\meStoN 26 Saso oft mh es 460 fSeeppae of snetna lite 
sion fosd aver silpim TOV gabtebd eer vesem a 
.nottonljelb Bevunade 047 mt Sateélits vids ttea 


<9%q faud Yishsyet eal 4oisblive-rzednvea Isnoct ISBA 
enodys) a5 er \tedslemon- faaY , bes easTtsD yet betnee 
-xse oi tod ot PoeqEes dtiw TOW Bortimexs goede: «CeVOk) 
vil ssnahadies wove leksint Yo somenborteg st tra noisqes 
dab! pat-none st ‘Teeonblid Sas ,dehige? douse? ascbensd 
détbedteD ade \gotddesssa Bas noivoyhetg sod nt .asennege 
este x03 coltevo TOY Lassnesadwe Sswoite eyolsaqe magest 
“adene> gote Inltini to (selev bee \Jeoiga elder) aseaelo 
. « « ™ 2ofioeex seeds gedd bemlonon in $5 sxvemeuc> ~e3Re 


sit x0) 60d Sastoltive s tom ei TOV Jan? teenpere yipnesze 


fAsasa? halfiens? ni anoidzomiteth parotev Io aolaqeoreg: 


» (ase og) 


aifves: tiedt taf beveifed oafs siddtee sesrtT 


-19view beacdeta sit oF Joaqes1 Agiw dotistetienc besnsyaew 
Resbons® to a885 oii) ni ters Neriogks2 yedT .10 Bo are 


ws moms a a 


i 


Pioreciec dimensions of voicing, aspiration, and articulatory 
force (p. 426)", and offered the possibility that alter- 
Maeives Such «as articulatory force or rate of formant 
transition may be more important in the Canadian French 


perceptual classification. 


In light of the Korean, Hindi, and Marathi counter- 
evidence, Lisker and Abramson could not justify any general 
statement concerning the proposed universality of voice 
Onset time as a perceptual cue. Nevertheless, the authors 


offered the following conclusion: 


. . . this measure of voice onset time has 
been applied to word-initial stops in eleven 
languages and has been found to be highly 
effective as a means of separating phonemic 
categories, although these languages differ 
both in the number of those categories and 

in the phonetic features usually ascribed 

to them .. . It would seem that such features 
aS voicing, aspiration and force of articula- 
tion are predictable consequences of differences 
in the relative timing of events at the glottis 
and sat cue place Or oral vocciusion) (1964 2p. s422)7,.- 


In addition, several authors..(Eimas;) et .al.g, 1971; 
Bimas. andeCorbit,. 1973). maintained, the. universality.ofithe 


VOT dimension, and continued to expand its range of 


applications. 
Biological claims 


In the first of two closely related studies, Eimas 
et al., (1971) were interested in possible biological 


foundations for the categorical perception of VOT. By 
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conditioning the non-nutritive sucking activity of one- 

and four-month old infants, when exposed to stimuli varying 
in VOT, it was hoped that the results would indicate first, 
andiléterentialsresponse to.pairstofestimuld drawnsfrom 
different adult English phoneme categories, and second, no 
differentiation between stimuli drawn from a single category. 
Stimuli were chosen from the set of synthetic CV syllables 
prepared by Lisker and Abramson for their own VOT experi- 
ments conducted earlier at Haskins Laboratory. When a 
subject had been habituated to a stimulus, a second stimulus 
was presented, and any changes in the rate of conditioned 


response (non-nutritive sucking) were measured. 


The authors reported a significant increment in 
response rate when stimuli were drawn from different stan- 
dard English perceptual manner categories, and no change 
when stimuli were drawn from a single manner category. 

They inferred from these results that " .. . the means by 
which the categorical perception of speech, that is, 
perception in a linguistic mode, is accomplished may well 
besparkios therbhiologicalsmake-up,of theeonganism (pay30G)i0; 
Offering .equivocal conclusions 7 jbased ‘ongratherwinsut £i— 
ciently defined evidence, this study was used as the basis 


fou seucther textensions of the concept coreVOT, 


The Eimas and Corbit Model 


Fimas and Corbit (1973) extended the notions of the 


assumed biological significance of VOT. Summarizing the 
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Garlier* works’ regarding VOT, “such as lisker and Abramson 
(1964), and Abramson and Lisker (1970), they presented the 
EoUlOw noe Characterization: "Of particular anterest 2s 

the fact that the categorical nature of the perception of 
the VOT continuum appears to be universal (p. 101)." 
Moreover, the citation of Eimas et al. (1971) was presented 
as evidence of the pre-verbal occurrence of VOT discrimina- 
tion, leading to the conclusion that "The apparent 
universality of this phenomenon suggests that it is a 
manifestation of the basic structure of the human brain 

(p. 101)." To propose that the foregoing assumptions are 
questionable on fundamental grounds should reguire no 
further justification than the equivocal results of those 
studies cited as supportive by Eimas and Corbit, since 
their subsequent experimental hypothesis was contingent 
upon the previously discussed analytical and biological 


claims. 


Eimas and Corbit proposed that the categorical per- 
ception of VOT was mediated by a "Selectively tuned 
linguistic feature detector” system. Ihe system, they held, 
is composed of independent 'detectors', one of which is 
specifically tuned to.a limited range of VOT values Cor= 
responding to voiced English initial stop consonants, and 
the other to the range of VOT values corresponding to 
Vorceless initial stop consonants. Excitation Of the 


former feature detector (FD, ) would induce the perception 


of a voiced consonant, while excitation of the latter 
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detector (FD,) would induce the perception of a voiceless 


consonant. 


Several important assumptions were required if 
Eimas and Corbit's model were to account for previous data 
from categorical perception experiments. First, the detec- 
tors had to be differentially sensitive, each to its own 
limited range of VOT values. Eimas and Corbit suggested 
that ple vsenstvrvity™or-arqiven-detrector "I. miunt be 
Measured, in principle, by the output signal of the detector 
(>. #008) ." Nextjivsince stim@la could ‘presumably fall 
between the modal detection values of the two detectors, 
thereby exciting both simultaneously, it would have to be 
assumed that only the detector whose output signal was 
stronger would be recognized at higher levels of analysis 
in the auditory pathway. If, however, a stimulus were to 
excite both detectors equally, it was said to fall at the 


phonetic boundary (B). 


Certain inferences can be drawn from the foregoing 
assumptions. For example, one might expect that the "-in 
principle-" measurement of the sensitivity of the feature 
detectors would provide (as projected in Figure 1) a graph 
Of Output signal strength, elicited hy subjecting the 
detectors to a wide range of stimuli varying along the VOT 
continuum. The figure was constructed to agree with the 
properties discussed by Eimas and Corbit (p. 108). However, 


no such measurements were made by Eimas and Corbit. 
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Instead of pursuing the hypothesis that the feature 
detectors lexisttintthe "form idescribedpatiamasieand oCorbit 
Ghose tO CakeCarstep furthersand attempt itowrestethesef fects 
OfSadaptatvon ton “that mechanismiye Theyepostulateduthat 
repeated stimulation of either detector would cause fatigue, 
thereby Vessening its sénsitivity ecThesfunthersassumed, 
eEOr SDUrpDOSeS sor aSimplicity (po. LOS i" Pathat cuhe Sutput 
Signal strength of a fatigued detector would decrease 
equally for the entire range of VOT values to which it is 
normally sensitive. Figure 2 represents a projected graph 


of signal output strength after adaptation of FD Note 


2° 
that Pigure 2, like Figure 1, 1s qualified as this authors 


interpretation of the discussion by Eimas and Corbit 


(p. 108). They presented no such examples. 


The effect that Eimas and Corbit hoped to observe 
in support of their proposed feature detector system was a 
shift in the phonetic boundary toward the adapted stimulus. 
Adaptation to a long VOT should have caused an upward shift 
in the phonetic boundary (as in Figure 2), while adaptation 
to a short VOT should have caused a downward shift in the 


phonetic boundary. 


Here it is important to note certain shortcomings 
of Eimas and Corbit's model. First, given a binary response 
regardless of place of articulation, thar is, a one-way, 
classification of stimuli as either voiced or voiceless, a 


single "feature detector" would suffice. This is only one 
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area in which alternate models may have been considered. 
Second, in order to ensure accurate measurement of the 
phonetic boundary in the present model, it was necessary 
that the two feature detectors be 'specifically tuned' to 
overlapping ranges of VOT values. Although in this respect 
the efficiency of the proposed detectors falls somewhat 
SHOrt, Of 1deal, 1b was on this basic alone that the authors 
could postulate changes in the phonetic boundary as a re- 
sult of adaption procedures. In other words, had they 
assumed that the detectors were, in fact, sensitive to 
restricted, independent ranges of VOT values, then the adap- 


tion could not have been predicted. 


Bamas and Corbit,elicited.identificatiaon Lunctions 
for.labial~and.dental, stops, (b-p, d-t), both before and 
after adaptation. Their experimental design employed a 
two- alternative forced choice (2AFC) paradigm, not unlike 
the forced choice methods used in earlier VOT studies, in 
which, for any given condition, both stimuli and responses 
were Jimpted to a single place of articulation, and stimuli 
were to be identified as either voiced or voiceless. Using 
the Lisker and Abramson (1970) synthetic stimuli, initial 
identification functions were elicited for the labial and 
dental series, separately. Subjects were Chen asked to 
identify the same stimuli, after having been "adapted" to 
one of the other extreme of the VOT series for either the 
labial or dental stimuli. The obtained identification 


functions indicated a shift toward the adapting stimulus. 
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ThisWshitt wasijinterpreted as a, shitft.in the phonetic 
boundary, leading Eimas and Corbit to conclude that their 


meletevasi tine actpecsupported., 


Eimas and Corbit attempted to rule out the possibil- 
ity that any simple acoustic (as opposed to linguistic) 
variables might be held to account for the observed 
phenomenon. They showed that adaptation to an dental stop 
producedhayshaitlin the hidenti fication sfunction Eon labial 
Stopes. oreconversely, wadaptation toa, babial. stop -produced.a 
similarmshrit tim, thenfunctions. fonndentalastops.s.1On the 
basis of this cross-series effect, they proposed that the 
detector system was not sensitive to "Simple acoustic 
information", but only to "those complex aspects of the 
sound pattern that both series had in common, namely, voice 


onset i timer (pti 205) 6" 


This assumption, however, may not be warranted, as 
evidenced by the fact that the proposed "phonetic boundary" 
for dental stimuli was several milliseconds greater than 
that, torethe lbabtals stimuli (ci. p. 105), suggesting that 
VOT was dealt with in a somewhat different fashion for the 
twoe places, OLearticulation.. Injorder tosdeal “with this 
particular phenomenon (to which Eimas and Corbit attached 
no apparent significance), and yet maintain the original 
notion of fixed VOT-detectors, it would appear necessary 
to postulate either a separate set of vOT-detectors for 


each place of articulation, or an independent mechanism, 
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Capable,of translating. VOT sidan different: contexts or 
places of articulation, into some unambiguous form amenable 
to the feature detector mechanism. FPurthermore, it is not 
at all clear that the only feature shared by both series 


was VOT. 


One might conclude that the mechanism described by 
Eimas and Corbit might have been more reasonably characterized 
as a highly complex mechanism, composed of numerous "feature 
detectors) jjaldsofewhich contributesto) Lhe  grosstitunction 
of the mechanism. For example, consider any complex 
mechanism, composed of smaller, functional units. The loss 
or alteration of the function of any of the components 
through adaptation procedures will, to a certain degree, 
affect the output function of the mechanism. Thus, experi- 
mental alterations of gross function (much like the results 
cited by Eimas and Corbit) would not be a clear indication 
of the specific properties of the mechanism, unless those 
lower-level functions had been anticipated. In light of 
possible alternative explanations, Eimas and Corbit's results 
are neld to offer no Unequivocal support of their particulier 


model. 
Statementootn then Problem 


The issue of greatest importance here is not simply 
whether the hypotheses put forward by Eimas and Corbit (1973) 
were supported. The main question to be offered is whether 


VOT alone, as originally defined by Lisker and Abramson (1964), 
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Pe trury-a Sulit icient "criterion forthe ascignment of 


phonemic status to initial stop consonants. 


if one were to regard certain established ‘cues’, 
suchas asprreatiton and force cr “artictlarion, as aralytica | 
Consequences "of “articulatory ana glottal gestures, then 
there would be no argument against the analytical necessity 
of VOT, since Lisker and Abramson have also discussed VOT 
in terms of essentially the same relative articulatory and 
evoevcal "adjustments.  On’the “other hand, to say that VOT 
alone is sufficient for the perceived distinction is to 
ignore other features which have been experimentally 
established as influential in the perceived voiced/voiceless 


arsernccrion. 


It would appear that a study of the effects of 
various treatments on the perceptual impact of VOT is in 
order. The acceptability of VOT as a universal perceptual 
cue should be contingent upon the replicability of earlier 
results under a broader range of well-defined experimental 
conditions. Furthermore, a critical re-evaluation of key 
assumptions underlying the concept of VOT and its acoustic 
consequences may provide a framework from which more reason- 


able accounts of observed phenomena may emerge. 


The present study was designed to examine the nature 
of the effects of place of articulation, manner of articula- 
tion, and white noise masking on the distribution of stimulus 


identifications as measured on the VOT continuum. Using 
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the median VOT value of a given response distribution as a 


measure of performance, the various factors were introduced 


on the basis of several considerations. 


First, since VOT is by definition totally independent 
of place of articulation, median VOT values for one place 
of articulationvshouldvnot differ siignifacant1lys fromethose 
Of another = placel of articulation, obyilimi ti noses Gimeda. and 
responses to a single place of articulation, the two alter- 
native forced choice (2AFC) paradigm, as used in earlier 
VOT studies, could not have provided any avenue for the 
direct comparison of different places of articulation. The 
present study, however, employed a four alternative (4AFC) 
paradigm in which mixed labial and dental stimuli were not 
only to be identified as implicitly voiced or voiceless, 
but were also to be classified with respect to place of 


arcaculatwon. 


Second, manner of articulation (voiced or voiceless) 
Nagel Cer tai qua lit veavuonsy,. bel regardedtiase the conly 
factor in the present experiment in any way associated with 
the perceptual distinction between voiced and voiceless 
ini valeiscopsaam Biglishww® [lS subysotsl are mndesimcapabie 
Of assigning the) phonemic status\ofi voiced orjwvoiceless to 
Stanule which vary an VOT ,tethen tccondamg tolprevicus me- 
sults, the distributions’ of voiced and voiceless responses 
should occupy relatively discrete ranges of VOT. That is, 


'vyoiced' responses should be concentrated at the low end of 
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the VOT scale, while 'voiceless' responses should be con- 
centrated, at «the jhigh end... Under -suchdécircumstances pcit 
follows that the median VOT values of those distributions 
should be consistently different from one another, as long 


as subjects maintain a discrete mode of response. 


Third, white noise masking was chosen as a factor 
ongthe basismohkts spossible \impLicaktons sin two areas. 
Pinot welt pthe temporal measunerot VORats 7; in fact, the sole 
perceptual cue in the voiced/voiceless distinction, then 
the masking of frequency information during the VOT interval 
Should have no effect on observed patterns of stimulus 
identification. Moreover, with respect to the "linguistic 
feature detector" mechanism proposed by Eimas and Corbit 
(1973), additional motivation 1s provided by considerations 


in the domain of signal detection. 


The separate effects of adaptation and of noise on 
a multiple detector mechanism are fundamentally different. 
Adaptation is defined as a decrease in the output of a 
Simdle receptor (in this case, a “cuned derector |) an) re— 
sponse to a constant stimulus. The addition of random noise 
Fou cional snoulo; not, have this vellcou. i hevertectso. 
random noise on a receptor system has been described by 
Green and Swets (1966) in terms of 'interference', or un- 
certainty in sensory measurements. For fixed noise background 
levels, the amount of uncertainty is evenly distributed over 


the entire range of sensation of the particular modality. 
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Thus, random noise masking should not have a selective 
effect on a single receptor or output distribution, as 


would adaptation. 


Pinagllyvjswith respect to tne nocion of “fixed” 
neural detectors, subjects should not be expected to differ 
Significantly in patterns of performance. One important 
aspect of the present study is the consideration of the 


possibility of differences among subjects. 
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CHAPTER II 
METHOD 
St mile 


In nearly all respects, the synthetic CV syllabies 
used as stimuli in the present experiment were identical 
to, Enose used by Himas and Corbit: (1973)... Im order to 
achieve some Understanding Gf the types of differences 
DELWeeMTNe worse ts..of stimuli, vt moy, be of assistance to 


first present a brief description of ‘the latter. 


The stimuli used by Eimas and Corbit were those 
generated by Lisker and Abramson in 1967 at the Haskins 
Laboratories for use in their early cross-language tests. 
They used a parallel resonance synthesizer, equipped with 
" . . . three formant resonators with variable frequencies 
and “amplitudes, a choice of buzz or hiss excitation, OG ta 
mixture of the two, and control of the overall amplitude 
and fundamental frequency (Lisker & Abramson, 1967, pp. 563- 
564)." All of the synthesized initial consonants were 
followed by a three-formant approximation to the vowel [a], 
and the labial, apical, and velar place categories were 
achieved by the addition of appropriate release ™bursts*and 
formant transitions at the beginning of each syllable. Voic- 
ing before consonant release (voicing lead) was represented 
by the insertion of low frequency@harmontes Orethe=bugz 
Source. VOicing lag, Volce onset arter consonant release, 
Was achieved by supression Of the Onset ole the= tars tetorment, 


relative to the second and third formants, while filling 
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the interval between consonant release and voice onset with 
hiss excitation of the second and third formants to approx- 
imate aspiration.| Of tthe broad irange of 37 VOT ivariants in 
each place series; (150 msec before freleaselto 150!imsec after 
release) synthesized by Lisker and Abramson, Eimas and 
Corbit used only 14 from the labial series (-10 msec to +60 


msec) and 14 from the dental series (0 msec to +80 msec). 


The present experiment employed a Parametric 
Artificial Talker (PAT) in the synthesis of the required 
stimuli (see Anthony & Lawrence, 1962, for original concepts 
in the development of the PAT). The PAT was driven and con- 
Evolled by va PDP=l2 digital computer. Included in) PAT" s 
eight control functions are: | Frequencies; of the three 
variable formants, frequency and amplitude of the glottal 
waveform, amplitude of the wide-band noise source for excita- 
tion of the formants, and finally, the central frequency and 
amplitude of the second wide-band noise source, which is 
independent of the formants. This second noise source, most 
commonly employed in the synthesis of fricatives, was not 
needed for the purposes of the present study. Control 
voltages for the remaining six parameters were graphed as a 
function of time) (fvoure 3)), digituzediby means Of a Hewlect— 
Packard F-3B Line Follower, and stored in computer memory 
(see snl, £1969, sand -Akitty..1970;, for=further detail lool PAL 
programming in conjunction with the PDP system). A general 


overview of various speech synthesizers is also available in 


Cooper (1961). 
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Pugqure 3: PAT control functions for the synthesis of two 
labial Stimuli and two dental stimuli. 
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dwO,sets Of Stimuli, a labial. series and a dental 
series, were prepared, Gach containing 14 VOT variants. 
rhe appropriate (Delattre, et al., 1955) formant loci and 
transitions were incorporated to identify initial conson- 
ancs as @Cither ‘labial or dental. All! of the 500 msec CV 
syllables ended with a three-formant approximation to the 
vowel [al, with a fundamental frequency of 110 Hz, dropping 


toward the end. 


With duration of transition set at 40 msec, the 
three vowel formants for the labial series in ascending order, 
started at approximately 370 Hz, 1000 Hz, and 2500 Hz, ris- 
ung “Go /607liz, 1280 Hz, and 2650 Hz, respectively,  Formants 
for the dental series, in ascending order, started at 
appLoximacely s/0 Hz, 1600 Hz, and 2800 Hz, with FL rising 
Lo approximatezy 760 Hz, while F2 and F3 fell to 1280 Ha 
and 2650 Hz, respectively. Variations in VOT were achieved 
by shifting the onset of the glottal waveform with relation 
to the release burst. VOT variants for the two series were 
the same as those used by Eimas and Corbit; the range of VOT 
for the labial set was from -10 msec to +60 msec and for the 
dentals, from 0 to +80 msec. In both series, VOT varied in 
5 msec steps, up to +50 msec, ater which VOT increased in 


10 msec steps. 
Recordings 


Ten different randomizations of the entire set of 


28 labial and dental stimuli were divided into two equal” 
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Blocks of five different randomizations (140 presentations) 
each. The stimuli, separated by two seconds of silence, 
were then recorded on magnetic tape, in analog form, from 
PAT output, using a TEAC A-7030 tape-deck. The final re- 
cording consisted of five complete Blocks of stimuli 


(Block order 1=-2-1—2-1), or 25 presentations of Gach stimulus. 


During the experiment, the recorded stimuli were 
played back through the left channel of the tape-deck. 
Peak rms syllable value was held constant at optimum play- 
back level, as monitored by the left channel VU-meter. As 
a source of the white-noise masking signal a GRC (General 
Radio Company) 1382 Random Noise Generator was used. Precise 
variations in signal-to-noise ratio were achieved with a 
Hewlett-Packard 350D Attenuator Set. The white noise out- 
put of the noise source passed through the attenuator, and 
finally, into the right channel of the tape-deck, where 
rms value could be monitored (see Figure 4). By regulating 
the white noise output level, signal and noise were bal- 
anced. For S/N=O peak rms syllable value and rms white 
noise value were equal. Holding the signal constant, it 
was then possible to vary the attenuation of the white noise 
in 10 dB steps, from zero attenuation (S/N=-20 dB) to 40 dB 


attenuation (S/N=+20 dB, white noise barely audible). 
Presentation 


The two separate output channels from the tape-deck 


were mixed, using a Braun CSV 250 Power Amplifier, and the 
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Figure 4: Shematic diagram of experimental apparatus. 
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resulting monophonic output was presented to subjects 
binaurally, at a comfortable listening level, using Telephonics 
TDH-49 earphones, with MX41/AR cushions. The level of white 
noise was changed after each Block presentation, in such a 
way as to prevent any two consecutive Blocks from having the 
same” Yevel of background! noise.» On the! basis Gf S/N ratio, 
subjects were divided into two equal groups of five as 
indicated in Table 2. The order of white noise variation 

for Group 2 subjects was basically a reversal of that pre- 
sented to Group 1 subjects. Group 1 and Group 2 subjects 
Were teseed separately, cach normally in three “separate 
one-hour sessions, of five, ten, and ten Blocks, respectively; 
the first session included a set of pre-recorded instructions 
(see Appendix A). Sessions were separated by at least 24 


hours. 


In what was essentially a 4AFC procedure, subjects 
were required to identify each stimulus as belonging to one 
of the four possible response categories, [p], [bl], [tl], or 
[d], and to record each response, according to presentation 
number and category, on a prepared IBM answer sheet. Responses 
were tabulated by an IBM optical scoring machine. The final 
eset of raw data for each subject ancluded 25 identifications 
Of Gach of the 28 stimulr under Gach of the five. Signal—to— 
noise conditions (25 x 28 x 5 = 3500 responses for each 
Subject)... The basicrexperimental design for stimulus pre- 


sentation is represented in Table 3. 
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TABLE 2 


DISTRIBUTION OF S/N CONDITIONS BY Stoo LON 


TOTWO- GROUPS.-OF..FLVE-SUBJECTS 


; Block Number | 
Sessa tl ok et en P| 


TABLE 3 


BASIC DESIGN FOR BLICITATION OF PSYCHOMETRIC 
FUNCTIONS FOR A SINGLE SUBJECT 


< 
oO 
ire 


MASKING 


SpE CONDITION 
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LABIAL 
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Subjects 


Subjects were ten native English-speaking female 
student volunteers from the University of Alberta, whose 
ages ranged from 18 to 36 years. None of the subjects had 
had any experience with synthetic speech. All were sub- 
jected to standard Pure Tone Audiometric (sweep) Tests, 
with the use of a Beltone 10-D Audiometer. The mean results 
(both ears) for each subject are presented in Appendix B. 
With only one exception, subject HA, age 33, who showed a 
decrease in right ear sensitivity to frequencies above 
6 kHz, all were found to be normally sensitive to frequencies 


in the speech range, with threshold criterion set at 15 dB. 


a 
. os oy 
@ 


4 
“tnt Gibteageiient 
saodw jadxedia to ie 

bed ataatdne git 10 Snow +e 
-due extw 174 .tosede! ols stave sidkw 2 . 
.BdasT (qeews) otitomotbvA or fae Nicene or B 
esivesy apom oct .tetemorbuaé a~ OL enpetse s to sey oft 
4 xbbnstgA oi betnsesitq S35 Ler eer 10] (ets ¢ oa 


6 bewourts odw \ft eps ,AB-tostdue ..qpksgsses ono yino f it iw 
avods np ianeopest ot ¥iiveclanse tas Jdpia nk 9 


th 
_ 


(4 
eatonsups.2 oF svistedee vitemton ed od brvct exvow fis «sa * 


4h @2 se tee vlorretiso Blodteetey Histor \sbost dofega oid o 7 


CHARTER, LUE 
RESULTS 


Data Preparation 


Previous experimenters have generally applied a Two 
Alternative "orced-Choice (2AFC) method to elicit responses. 
POr@ear Ovens placevor articulation, say, labial. or dentat ,.a 
particular VOT variant was to be identified as either voiced 
or voiceless. Cumulative frequency distributions of cate- 
gory identifications by VOT were readily constructed, and 
the 50% cross-over point was taken as an overall index of 


the subject's performance under a given set of conditions. 


Such a direct index was not readily available in 
the present experiment, which employed a Four Alternative 
Forced-Choice (4AFC) paradigm. Labial and dental stimuli 
were mixed, and presented at random, and subjects were 
required not only to implicitly identify stimuli as either 
voiced or voiceless, but also to categorize stimuli accord- 
Ingato lace ol eaGliculation. (labialom dental je. Aca 
result of subsequent confusions in place of articulation, 
the cumulative frequency distributions of voiced and voice- 
less, Gesponses, based on total presentations for a given 
Place Of articulation, were asymmetrical. Thus, it became 


necessary to employ an alternate” index” of the subjects 


response. 


It was believed that stimuli incorrectly classified 
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With respect to place of artircwlation (place errors), re= 
gardless of manner identification, could be treated as 
"errors". Subsequently, they were dealt with independently 
of the remaining "correct" responses. Discussion of errors 


appears in Chapter IV. 


Dealing only with "correct" responses, it was ap- 
parent that the two cumulative frequency distributions of 
voiced and voiceless identifications would have to be 
represented by separate VOT values. That is, a single index 
was needed to represent each of the separate voiced and 
voiceless response distributions elicited for each of the 
two stimulus series (labial and dental). Given that the 
distributions under consideration in the present study 
are of an asymmetrical nature, the mean would present a 
biased estimate of central tendency. The asymmetry was 
SUELUCTent to suggest the distribution mid—point as sthe 
preferred statistic to represent central tendency in the 
analysis of "correct" responses. The resulting VOT index 
values for voiced and voiceless identification distributions 
(Appendix C) were used as basic data points in the experi- 
mental design for the analysis of variance presented in 


Table 4. 


Analysis of Variance 


The analysis Of Variance can be described as a 
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TABLE 4 


EXPERIMENTAL DESIGN FOR ANALYSIS OF VARIANCE 


DENTAL 


7 —s 


| 
, . 
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articulation) by 5 (masking Wevel) factorial design with 
bPepeated measures on all factors, for tem subjects. Each 
of the 20 cells in the present design (Table 4) contained 
the corresponding VOT index values ‘computed for each of 
the ten subjects. The results of the analysis of variance 


are presented in Table 5. 


Placevby Masking Interaction 


Although significant overall F-ratios were observed 
for both the Place and Masking factors, the interaction 
between them was also significant (F=3.798, p<0.01). Thus, 
the two factors cannot be discussed independently. The 
interaction is represented by Figure 5, in which VOT 
(ordinate) is plotted as a function of masking level 


(abcissa) for the two places of articulation. 


Figure 5 indicates that, as a result of masking, 
VOT values for the dental series varied over a broader 
range than did yot values for the labial series. An a 
posteriori Newman-Keuls test (Ferguson, p. 274) for signi- 
ficant differences among means was performed on VOT's for 
the labial and dental series separately. The results of 


the Newman-Keuls tests are presented in Tables 6(a) and 


6 (b)... 


Table 6(a) would seem to indicate that VOT's for 
the labial series did not undergo any Significant change 


as a function of masking, while Table 6(b) reveals that at 
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TABLE 5 


ANALYSIS OF VARIANCE 


MEDIAN VOT VALUE AS A FUNCTION OF 
PLACE OF ARTICULATION, MANNER OF ARTICULATION 
AND MASKING LEVEL 


SOURCE ss df MS 2 


Place 
Labial/Dental 
(A) 48.585 1 48.585 15 2 exe 


Manner 
Oiced/Voiceless 
(B) 23356.87 Hl, 23356507 13 2 5eiO Drea 


Masking Level 


(C) 179.049 4 an 162 14.001*** 
A xB 1.086 ih 1.086 0.340 
Axc AS 5ae). 4 228 3.798%* 
BxcC 16.708 4 ALVTT 1.303 
Pare ex AG 13.970 4 eA? 1.093 
Subjects 
R (ABC) 623.784 9 69.309 21.7065*** 


Error 545.9460 Teel Sr LO 


e2MbS se6 


e229. c8C% 


ere TOo. ht 
dbe.0 

weer se 

foc.£ 


€¢6.f 


weneaos LS 


coc ae 


78. Set€s 


Soy ob 
ago,t 
Ser ess 
TVL.b 


Sep. £ 


QhE, ea 


cel t 


{ 


be. Bb 


fa. BeSEs 


eho. OF. 

480.1 
££2.86 
BON .eL | 9 * 


ove ef 


nat, eRe 


OPRR ERE 


a) yen eS 
<—(99SW) [OA 


Picure 5S: 


are 


2| 
20 


VOT aS 4 LURCtION Ob S/N ratio, 


41 


4~ 


4 
Ay (west)—> 


sider W\E 30 nolt5an? & en TOV 3é equeiy > TF 
: 7 : ry 


rr va | ; _ 
iT oe! al 


42 


TABLE 6 


NEWMAN KEULS TESTS FOR SIGNIFICANT DIFFERENCES AMONG 
MEAN VOT VALUES BY S/N RATIO 


LABIAL SERIES 


Differences Among Means 


aonheo he) 


Mean VOT's 


in order of 
magnitude 


+20 °CS 0 e+L0 dB 


+10 
0 dB 
(b) DENTAL SERIES 


Differences Among Means 
=1 0508 +20 08 +10 08.20 dB 
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Messe evoleror 0 dB, -10 db, eand) —20 de, VOT! s “for the 
deneal series were, in fact, Significantly ciftferent. ‘This 
would suggest that even though the labial and dental profiles 
in Figure 5 are in some respects visually similar, the 
effects’ of ‘masking may only be realized in the case of S/N 


Fatios thélow 0 dB for the dentalwseries alone. 


Why the dental series should be more sensitive in 
this respect to white noise masking is not a question which 
can be simply answered. This question will be discussed 
ingChapter IV. It will suffice at this point to state that 
VOT values for the labial series remained much more stable 
throughout the observed range of white noise masking than 


did those for the dental series. 
Manner of Articulation 


The exceptionally high F-ratio for manner categories 
(F = 7315.65; p<0.001) was not altogether unexpected, and 
reflects the fact that this factor was by far the largest 
source of variation in the peeeene analysis. VOT values 
for each of the four response distributions are plotted as 
a function of S/N ratio in Figure 6. Directly observable 
im Eigure 6 is the fact that for either place of articulation 
the large difference: Vin msec) \between VOT values for 
'yoiced' and 'voiceless' response distributions remained 
quite stable throughout the entire range of masking condi- 
tions... On. thesbasisnofsathe consastentilyeseparate 


VOT values, it may be argued that white noise masking had 
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Figure 6: 
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S/N CB) 


VOT's for each of the four response 


categories as a function of S/N ratio. 
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ne effect on the ability of subjects to label stimuli as 


voiced or voiceless in a discrete fashion, on the basis of 


Vioive 
Subjects 


Differences among subjects were found to be signi- 
Picanty (bys). 107; pe 0,000). stAlthoughathe specifticynature 
of those differences is not directly considered in the 
present study, the significant F-ratio associated with inter- 
subject differences indicates that one must approach the 
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CHAPTER IV 
DISCUSSION 


Although various interpretations can be offered to 
account for the results of the present study, above all, it 
israpparens that VOT is,Jin fact, an effective signal char-— 
acteristic in the identification of voiced and voiceless 
Pitta lestop consonants in English. — However, in light ef the 
interaction between the effects of place of articulation and 
masking on VOT values, complemented by the results of the 
a posteriori Newman-Keuls tests, it must be acknowledged 
that certain aspects of the dental stimuli rendered them more 
vulnerable than the labial stimuli to the effects of white 


noise masking. 


At high levels of masking, the effects of white noise 
on frequency domain properties of the stimuli was sufficient 
Lop cause a Significant shitt in the VOM values for the 
dental response distributions, while VOT values for the 
labial response distributions remained stable. This leads to 
the assumption that in addition to VOT, some concurrent fre- 
quency domain stimulus features may play an important role in 
the identification of voiced and voiceless initial stops, 
even though those same features might vary according to place 


OL aAceiculation. 


Further motivation for the consideration of alternate 
stimulus features stems from the results of spectrographic 
analysis of the stimuli. The spectrograms inifagure 7 
suggest that at S/N ratios of -10 dB and -20 dB, the 
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Figure 7(a): 


Sound spectrograms of a labial 
(left) and a dental (right) 
stimulus, at 0 msec VOT, under 
three S/N conditions. (time 


along abcissa) 
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Pigures) (b): 


Sound spectrograms of a labial 
(left) and a dental (right) 
stimulus, at 50 msec VOT, under 
three S/N conditions. (time 


along akcissa) 
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initial noise burst signalling consonant release may have 
been completely masked, rendering VOT virtually undetectable. 
Under such circumstances, it follows that the voiced/voice- 
less distinction may have been sustained entirely on the 


basis’ oftfeatures other than VOT’ 


By removing the initial noise burst from the present 
stimuli, and examining subsequent stimulus identifications, 
these contingencies could be more directly examined. Within 
the limits of the present study, however, one possibly 
beneficial approach to the relative perceptual importance 
of additional stimulus features might be provided by an 


examination ofgthe confusions of place of articulation: 
The Confusion of Place Categories 


Figure 8 represents the number and percent of con- 
fusions of place of articulation under each masking 
cond@tion, for each¥of the VOR variants, from 0 msec” Eo 60 
msec (labial and dental series combined). This range covers 
all VOT values which overlapped both place series. As one 
would expect, place-errors were most pronounced at the 
highest level of background) noise. Moreover, there appears 
to be a certain S/N ratio (0 dB) above which errors are 
negligible (maximum error (2%), and below which errors 
increase radically. The incredse in” errorsets not equally 
spread over the entire range of VOT; at higher masking levels, 


relatively more errors resulted in the range of VOT above 


25 msec. 
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Tie Inctease in error as a function of VOT is 
plotted in Figure 9. The frequency polygons in Figure 9(a) 
represent the number of times each stimulus from the dental 
series was identified as a labial, while the reverse holds 
for Figure 9(b). In both cases/ place-errors are sub- 
divided according to manner category of response (voiced 
or voiceless). Once again, it is evident that dental and 
labial stimuli are treated differently by subjects. With 
respect to place o— articulation, dental stimuli were: more 


OF tenmmis-classitied*than labial stimuli. 


In light of the above observations, it would not 
appear unwarranted to assume that as the masking level 
increased, so did the loss of acoustic information necessary 
to determine the place category of a particular stimulus, 
and furthermore that the loss was more substantial for 
dentalistimuliastthan for labial stimuli. However, 1 is 
evident that at least minimal information about manner of 
articulation remained at even the highest masking level, in 
erderite enable subjects to identify stimuli from the thigh 
end of the VOT scale as 'voiceless', and those from the low 
end as "voiced'. This observationy in conjunction with, the 
analysis of variance results, supports earlier claims that 


certain cues for place and manner of articulation are 


independent. 


The relationship between S/N ratio and 'transmitted 


information' (in an information-theoretical sense) was 
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discussedysbygMidilerm and Nicely (1955 0.6348 )sipand, dis 
generally supportive of this interpretation. They calculat- 
eCeichat sivOLCingi jintormation’ tanibe transmitted, at.S/N 
levels 18 dB below those needed for the transmission of 
\oleacen nkonmation'. ~The problem sremains.; +however » o£ 


accounting for sthe)error patterns as,-they relatesto, VOL. 


iIncerplay Between S/N Ratio and VOT as a Possible Cause of 


Place Errors 


Numerous studies have reported experimental evi- 
dence to suggest that for both natural and synthetic speech 
stimuli, frequency characteristics of the release burst 
and adjacent vowel formant (F2) transitions are essential 
in the identification of stop consonants. The reader is 
directed to Miller and Nicely (1955), Delattre, Liberman, 
and Cooper (1955), Halle, Hughes, and Radley (1956), 
Taberman et al. (1957, 1958)) and Kozheyvnikov and Chistovicn 


(L965) =tor further discussion. 


For the present stimuli, these formant transitions 
can be precisely measured. In this case, all of the formant 
specifications of a given stimulus series were held constant 
(see Figure 3). ‘Systematic Varlations in VOT were achieved 
by simply shifting the onset of the glottal waveform along 


the time scale, with respect to a fixed point (t=0) correspond- 


ing with consonant release. In other words, the temporal 


was identical for all stimuli. 
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As noted in»Chapter:Ily,ethe. overall. timesalloted 
for formantitransition-was fixedsat 40emsects Labzaleand 
dental. stimuli,differed-only inethe direction andoinitial 
frequencies of the second and third formants. The 30 msec 
release noise burst was present in.all stimuli as a ‘source 
©£ initial short-duration-formant .excitationd Thus? .for 
VOTSSringthe,range Qimsecntoi35emsecp;eall eprsome of «the 
adjacent vowel formant transition was modulated by the 
glottal waveform, in conjunction with the initial 30 msec 
noise burst. For VOT'soof «40 msec .or more; atheronlywsource 
of formant excitation during the transition period was the 
30 msec noise burst, since the onset of the glottal wave- 
form,.did not occur until-after the formant !transitions: had 
been completed, and the formants had reached appropriate 
steady-state values for the following vowel [a]. From this 
assessment, one would expect that using a white noise agent, 
it would be easier to mask the formant transitions in 
stimuli which exhibit long VOT's than in stimuli which ex- 
Mabit- short VOF's. Strong support for this claim resides 
in the observation that, for both stimulus series, VOT is 


Bucnly correlated with place errors. for the vlabial series, 


Es 079520 nm0.01) 7. and ston tierdencaisserntes, 


r = 0.970 (p<0.01). 


On the other hand, this evidence could as easily be 
cited in support of the claim that another feature, namely, 
the duration of periodically excited formant transitions 


(dt), might also be offered to account for the same error 
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phenomena. From the perfect inverse relationship between 
VOT and dt sin wehnesrange, of a/0T stnom 0 msecuto 40 maec, it 
follows that dt is negatively correlated with errors, to 
exactly the same level of significance as is VOT. However, 
if the range is extended to include all VOT's from 0 msec 
to 60 msec, the correlations are no longer identical. In 
SCeorrelalvingverrers with VO ian the Labtal series, 1 — 0.9119, 
and’) im the case “of errors’ “and Ge (02 go) 2 eee ote Bare 
ly, for the error by VOT. correlation in the idental series, 
r= 0.28351, while r = -0.9635 in the correlation between 
errors and dt.° ine purpose in pointing out fthese "correla 
tions is to suggest that over an extended range of VOT, 
place errors might be more closely associated with the 


duration of periodically-excited vowel formants. 


This relationship might be clearer in Table 7, in 
which a more precise account of a number of correspondent 
changes in the structure of the F2 transitions for the 
labial and dental stimulus series are presented. From 
Table "7, 1t is-noted that for VOT“s of "40 msec or more, all 
stimuli are essentially the same, except for information 
conveyed by the 30 msec release noise burst. This might 
account for what appears in Figure 8 (S/N = -20 dB) to be 
the complete loss of ability to distinguish between labial 


and dental stimuli at over 40 msec VOT. 


Presumably, the same parameters could be specified 


for stimuli used in earlier VOT studies, if a more precise 


cntimiia gidunci seis stave * 
+) \oeem Ot od gazitt 6 not? tor oe: 


of . Storrs rr W Seer ae ytaw — 


| vo 


2 
Te] 


- 


oid 
i a cis ei 
Favawe TOY ei 46 @bqns>itiopre ‘ aie gi omse 9 dl a 
-_ 
7 un : | -_ 
seen 0 moxt af Tov Die. abufsak oF + Babnistss 2b pak one 
= as 
al .tedisashi weapri yt on Sts enofsiler1o9 arit .o8e 08 oa 
etre. =— <a@siuse Ieredel ono nr Tey fgiw atotts pome 
tn Limes (ST2e.098 = 2) «tb bres sxorte 7 9889 si ab 
1 PES) gcd 
a ‘ — 
peivee feataeh sri ot forsetotiee TOV ye votre ony 20 wa 
, > 
fi sd rotsefaraed siti ab civ?. 0-22 ot lw fees. 0 =a 
-ela saitt 4yo solidiogd ai Sonqumy ef? 25 bes © ape ¢- 
<TOv ZO Gpeisa BsSresekS ne Ttve Jans J@eppra OF ar = 
ms a 
wit otiw peterreves vlorolo erom ed tipem exes S . 
am ey tet + wd . (as al Camee wirreyg TO 10, 
- 
ni .t Sftdaet ai seTteslIo Sd Sevm qgtfawat261e1 eine 
j ‘ _ 
SHIHAGYSSs TOS TSsuman 6 1 IHORES SLCeT € esom 6 ds 
; 7 a - >, 
ed 162 adertéenax? £2 ott» Yo agndorte ody mt apa 
‘ , 
noua ,bedns22rToq £2746 eol5se ni buaede bs stish bei Tete 
Ise . 1 osem Ob Fo =*F0V tat sets bSeIon at tt taf 
: 7 
nOfigtioini. «to? sasons .,umée Siz iene ae 
tips cin! sewed seton saegiex 7 ~~ m4 


ad od, (Rb k= 


fgidel asawtor dutvedthde bb oF vid 


- 


Res aad sii Piri 


ave 8 cusped ot! bie site 


TOW ‘Dean es 
: 


LABIAL SERIES 


ielOr 
0 


58 


TABLE 7 


CHANGES IN F2 AS A FUNCTION OF VOT 
(Errors included for reference) 


DENTAL SERIES 


40 owl 


frequency of F2 at onset of glottal wavetormn. 


interval (Hz) between initial and target frequency for 
thes vowels [al (1 280cHz)e 

milliseconds of formant transition modulated by glottal 
waveform. 
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description of .those stimuli were available. . The, possible 
perceptual importance of these stimulus features has been 
given very little attention by previous authors concerned 
with VOT. Those authors carried out their experiments 
under the assumption that VOT was the only property of uthe 
stimulus which was systematically manipulated, while in 
Peality, stic Manipulation of VOT can be.shown to result in 
a number of related changes in other stimulus features. On 
the basis of some of these features, it may be possible to 


re-evaluate the stimuli. 


In general, periodic signals have been shown to be 
less vulnerable to the effects of white noise masking than 
aperiodic Signals (Hirsh, 1952). With respect to "pilace-— 
information' conveyed by the release burst and adjacent 
formant: LransAtions, stimuli exhibiting ssnone VOL svcan se 
said to represent the former type of signal, while stimuli 
with long VOT's resemble the latter. It is therefore likely 
that in order to cause a Signuticant amount of lace con— 
fusion, stimuli exhibiting a distinctive amount of periodic 
excitation of formant transitions (short VOT's) should 
require a higher level of random masking noise than stimuli 
which exhibit little or no periodic excitation of formant 
(long VOT's)-—-Lt-2i6-—important-co-note=tieat—tirs- hypothesis 
contradicts the claims made by Fletcher (1929) to the effect 
that voiceless (tense) consonants are inherently more audible 


than voiced (lax) consonants. 


olctetog sil? etust tee: eed i 


i 
hantoynoo. erolirtus aoe vd “eal 
- = aa 


esnemineqxe ToAT BHO boitt69 balan 7 v 3) 


Ki br 
—s 

_ 

fi 


— 


~ino ort Baw TOV ie | ae on 


arty to ¢eusqotg 
ee 


ai ofttw ,Bodolaginem cnteuates 26w same oh at mks a 
ut sivesx of mwola od aéo TOV to aetasludines odd ool Te 
no .besoteaa aptumise sedso ns sepaens pbedeliss 26 redmun 8 
a ; 
4 ofdteseq ad yeu fi .esuuetesd seeds io sikoe to eieed ei: 
-tfoumets edt ed aus 


ad of meoré goed! svar alaAadie ofbetaaq fexverse ot 


» “sa 


necd ontteam ceton s7itfw to £75932e ‘ess ‘os etdessalev, 


sx dtiw .,¢82el ydearl) alsnplie okbot: 


jneostbhs bas. tetud sabeToa2 sas vd beyevnoo ' nokta 
: nr 7 
P ; ; 7 : : _ 
doom e!TOV Gaode ontticdtive tiotese ,2noitlenets Iasnso2 
: sine 
“ « — os a ~~" _o- 
ifumtde efinw .isnplie to agqy + 4omze? gas JAORSIIST ot Oi Be 


7 _ 
ylewtii oreistenst = ti +7 weddel afd efadmeeou a’ TOY varok dsb 


=t69 scAhlg to tawotn tascitiagie 6 SEYED oF 7 "ba ft aa 


=F 2 
»fhoriea to jrvone avitomisaid s ears bedinxs biumite 0. 
B 
4 ; 
blvona (a' FOV sxotle) entobrisnstd tnemo® tom 
- <IT 


tisnite asia seion naidaanm wobdnet a fevet won Ze rs m4 
tasm20% to aeiYs2 tons othoiseg ont -_ eieadt 
ahasdsogys aids ted afon ot. dss aa 
geste sid od (esex) salads eI% ot she 
sickhas S320n -ylonnon as 


te reli 


60 


The dirterences "in”-error patterns migit also be 
approached in terms of differential sensitivity to masking 
effects. “Fischer=Jorgensen (1954) pointed out ‘certain 
acoustic differences between [p] and [t] in natural. speech. 
He descripea [t) es Having Nicqner anhterent resonances "than 
[p!, and” before a central vowel, [t] was indicated by a 
steep falling F2 transition. |(Bécause ot the relative 
weakness of the F2 transitions associated with longer VOT's, 
one might expect the transition to be quite easily masked, 
leaving only the stronger, low-frequency energy associated 
with [p]. In such a case, to expect a [p] identification 
would not be unreasonable, since, as Fischer-Jorgenson 
Stated, © . . « Lf there 1s No positive reason for Nearing 


PE) COL-e tl, Chere will ®bhbecg maqort ty or ip) (O.o0)) | 


Within the present framework, the relationships 
which hold between confusions and VOT may be reconsidered, 
and possibly explained, in terms of the differential 
sensitivity of the present stimuli to white noise masking. 
This approach toward more complete stimulus description may 
Drove beneticial in the explanation Of “Observed VOT dis— 


criminability as well as confusions of place of articulation. 


Formant Transitions as a Plausible Cue 


for Stimulus Discrimination 
i OE eG DS ee eee 


Tiberman, Harris; (Hoffman; sand»Griffith 61957) 
asserted that the listener can only discriminate between 


speech stimuli which he can identify as belonging to 
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different phonemic categories. Based on identification 
functions, the phonetic boundary between two categories was 
established by the point of subjective equality (PSE), 

that point on the stimulus continuum at which a stimulus 
was identified as belonging to either category with equal 


probability. 


Lisker and Abramson (1970), Lazarus and Pisoni 
(1972), and Eimas and Corbit (1973) generally agreed that 
with respect to the VOT continuum, the phonetic boundary 
between the English voiced and voiceless categories was 


found in the region of 30 msec VOT. 


Functions of discriminability along the VOT 
dimension have been reported by Abramson and Lisker (1970), 
hazarus and Pisoni (1972), and Himas and =Corbit. (1973). 
Their results were obtained by ABX triadic comparison pro- 
cedures, in which the third stimulus in a series was to be 
judged as identical to either the first or the second. The 
authors reported a peak in discriminability at about 30 msec 
voT. Performance at chance level was observed at about 15 
msec in either direction from the peak. This consistency 
fe nol subpprising an light of sine tact thacve li tirec 
studies employed the same (Lisker & Abramson, 1970) synthetic 


Stimuli. 


Optimum VOT discriminabality inethe region ol the 
phonetic boundary was offered by the above authors as 


evidence to support their claim that the listener 
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discriminates between stimuli on the perceptual basis of 
phoneme class membership. The only discussion of formant 
transitions associated with the stimuli used appears in 
Lazarus? and*Pisond (1972 ,—pt 2) LSA though very 14 ete 
information was presented about specific frequency 
parameters, it was stated that the duration of formant 
transitions was fixed at approximately 50 msec. An 
essentially complementary relationship has been established 
between VOT and the duration of periodic excitation of 

vowel formant transitions. Thus, earlier reports of optimum 
discriminability at just below 30 msec VOT may be restated 
as optimum discriminability for periodically~excited formant 


transitions of just over 20 msec in duration. 


Under this interpretation, it is possible that peaks 
in observed discriminability along the VOT dimension may be 
due to limenal acoustic properties of associated formant 
transitions, rather than to abstract linguistic parameters. 
Such a proposal is by no means unwarranted. On the contrary, 
it would seem premature to assume the necessity of highly 
specialized mechanisms for the’ acoustic: analysis* of" a 
particular type of speech signal without having demonstrated 
first) thatthe’ signal cannot ™ be regarded asa Complex of 
more fundamental acoustic variables, and second, that general 


auditory mechanisms are incapable of the same basic analysis. 


Nabelek and Hirsh (1969) examined the ability of 


listeners to discriminate among various frequency transitions. 
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They computed relative difference limens (DL's) for 
frequency transitions varying in duration, frequency shift, 
and general frequency region. Three frequency regions were 
examined: 250 Hz, which the authors felt was related to 
pitch and intonation, and 1 kHz and 4 kHz, which they 
regarded important regions in the perception of formant 
transitions. In general terms, they reasoned that the 
listener relies heavily on "glide rate" (frequency shift 
per Unvte Lime) “in the discrimination. In addition, their 
results indicated, with the exception of small frequency 
shifts in the pitch region, greatest discriminability 
(smallest DL's) obtained for transition durations of from 
20 to 30 msec. Thus, since these measures agree well in 
Magnitude with the range of durations of frequency transi- 
tions in normal speech, especially in the F2 - F3 regions, 
Nabelek and Hirsh concluded that the observation of peaks 
iH) -aEseriminabtlity-retlvects*a*’™. . ". general propervy (of 
hearing and that it does not appear only in connection with 


speech sounds (p. 1518)." 


Further general agreement with the re-evaluated 
results of VOT studies suggest that certain important 
acoustic aspects of VOT variants might also be analyzed 
by auditory mechanisms which are not specific to speech 
dialysiae Wirth respect to the “linguistic feature detector” 
model proposed by Eimas and Corbit (1973), it may be con- 


cluded that no such complex speech-specific mechanism 


(i.e., a phonetic-level processor) need be postulated, and 
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that various more general auditory models can be offered 


to account for the same phenomena. 


The Need for Further Research 


Oné point which is obvious is that a great deal of 
further research is necessary in areas related to VOT. 
Experiments should be designed for the examination of the 
Nature of inter-subject differences. Although it is 
believed that the present data were elicited under a broader 
range of experimental conditions than data elicited in 
earlier studies, the experimental design itself leaves 
many important research questions out of reach. For ex- 
ample, in a number of different ways, the present results 
suggest that place of articulation is a very important 
factor in the consideration of VOT. Surmounting past meth- 
odological limitations, VOT studies should be expanded to 
include the perception of initial velar stop consonants 
and possibly the affricate series. Furthermore, the pre- 
sent shift of attention to such stimulus features as 
formant transitions necessitates the inspection of VOT and 
consonant perception in relation to a broader range of 


following vowels. 


The apparent co-operation of VOT with other acoustic 
properties of the stimulus has given rise to several 
possible re-interpretations of experimental evidence which 
has been provided by earlier VOT studies. In this respect, 


agreement of certain aspects of the present results with 
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those of previous studies, dealing with both speech and 
non-speech stimuli, provides strong motivation for future 


postulations of more general auditory models. 


CHAPTER V 
SUMMARY 


Voice onset time (VOT), defined as the time interval 
between consonant release and the onset of laryngeal pulsing, 
Originated as an analytical measure in the determination of 
voicing in initial stop consonants. More recently, VOT has 
received increased attention as a ~ossible ‘cue! for the 
listener in the perceptual manner categorization of initial 
stop consonants. Based on the assumed perceptual universality 
of VOT, a "linguistic feature detector" model has been of- 
fered to account for the differentiation between English 


voiced and voiceless initial stops. 


The viability of a language-specialized detector 
system whose sole input is VOT must be contingent upon empir- 
ical evidence of VOT's perceptual universality. A three- 
factory experiment with repeated measures was designed to ob- 
serve the effects of place of articulation, manner of 
articulation, and white noise masking on the identification 


of stimuli in which VOT was systematically varied. 


The voiced-voiceless distinction was apparently main- 
tained throughout the observed range of masking conditions, 
suggesting, that VOU Ss, inatace, san impor tano oue lon cic 
distinction. However, under high masking conditions, median 
VOT values for dental response distributions changed signi- 
ficantly, while those for labial distribucions.did not... this 
result was interpreted as an indication that in addition to 


VOT, a certain amount of frequency domain information such 
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as aspiration and formant transitions may play an important 
role in the perceptual distinction between the voiced and 


voiceless categories. 


High masking levels were also associated with the 
confusion of place of articulation. An attempt to determine 
the cause of such confusions gave rise to a re-evaluation of 
the present synthetic speech stimuli in terms of their 
primary acoustic (as opposed to linguistic) composition. A 
number Of Other Stimulus characteristics, including the rZ 


transition, were found to vary concurrently with VOT. 


When considering the results of stimulus re-evaluation 
in terms of earlier studies concerning both speech and non- 
speech auditory analysis, two important generalizations 
emerged. First, the concept of VOT as a universal perceptual 
cue may be inadequate, since associated theories place little 
perceptual significance on the dynamic acoustic components 
Of Ehe stimulus with which VOUT 2s closely related. in turn, 
VOT might be regarded as a composite feature, subsuming a 
number of dynamic, perceptually relevant features, of which 
the interval between consonant release and the onset of 
laryngeal pulsing may be only one. Second, the fact that 
certain stimulus features which vary concurrently with VOT 
can be quantified at the acoustic, as opposed to the lin- 
guistic level, prompted the suggestion that VOT might be 
regarded as a primary acoustic feature rather than a primary 


linguistic feature of the stimulus. 
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Previous experimental results suggest that the 
auditory analysis and discrimination of frequency transitions 
may be a general auditory process, applicable to both speech 
and non-speech stimuli. The fact that formant transitions 
play an important role in initial consonant perception, and 
are also listed among stimulus features which vary concurrently 
with VOT, suggests that at least some of the properties of 
VOT stimuli can be analyzed or at least differentiated, on 


pre-phonetic grounds. 


Such hypotheses present alternatives to the "linguistic 
feature detector" model and for that matter, any other 
auditory models requiring highly specialized mechanisms for 
the analysis of complex speech stimuli which could possibly 


be reduced to more fundamental acoustic components. 
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APPENDIX A 
INSTRUCTIONS *1TO SUBUIECTS 


Hi. Thanks for helping me out by being a subject 
in this experiment. What I'm trying to do is to find out 
how English speakers distinguish between certain types of 
initial consonants. The tapes you are about to hear are 
made up of syllables in the form of a consonant followed by 
tne vowel a. All I'd like you to do is"to identity the 
consonant that you hear. I can tell you now that the only 
consonants you, will hear will be p's, b's; Es, and da‘ sy 


followed by a. 


In the experiment, pats, ba “Ss, sta le,;.and das will 
be mixed together, and presented in an irregular order for 
you to identify. Each individual presentation will be 
follwed by two seconds of silence, in which time you must 
identify the consonant you heard by simply filling in the 
appropriate location on the answer sheet. Obviously, some 


guessing will be necessary in cases where you aren't sure 


of the answer, but you have some idea. In these cases, I 
Wani.you GO guess. You must give a response for every item 
presented. In case you honestly miss an answer, simply 


leave it blank and ogo On tO The Text. We can COLrrecc (Enact 


later. 


Usually, answer sheets like these, which are analyzed 


by a computer, require that mistakes be completely erased. 
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However, in case you should accidentally fill in the wrong 
answer slot, time won't allow you to erase and fill another. 
So, Just lightly cross out the mistake, and fill in what 
you think is the correct answer. Afterwards, I'll go back 
and erase the crossed-out answer for you. If you have any 


questions at this point, please raise your hand. 


Lfnyouyhaven't alreadyndone’ so, spleasecfill*inesthe 
information for which blank spaces are provided at the top 
of the answer sheet. ©So that I can keep track of your 
answers, you will be required to repeat this information at 
the top of each answer sheet. Pay no attention to the 
space marked S.#; it refers to your ‘Subject Number', and 
will be assigned later, during the analysis. Below the 
heading 'ID Number', please write your Student Identification 
Number, if you have one. The ten columns to the right of 
the boxes represent the numbers zero through nine, respect- 
fully. Fill in the slot representing the number you have 
written in each corresponding row. Please be careful not to 
extend your pencil marks beyond the dashed guidelines; it 
presents problems in the computer analysis. Are there any 


questions? 


What you are about to hear, then, are a few examples 


of the stimuli drawn from the p-b group: 
(Four examples presented here). 


Now listen to a few stimuli chosen from the t-d_ group: 
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(Four examples presented here). 


Now, see if you can distinguish between pa's and ba's, ta's 


and da's, when they've been mixed together: 


(Minimum and maximum VOT representa- 


tives presented alternately) 


Since you will be making quite a few identifications, 
I've divided the presentations into small sets. After each 
column has been completed, there will be a short rest period, 
after we have completed the third page, we'll take a five 
or ten minute break. Are there any questions before we 


begin? 
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75 


HEARING THRESHOLD LEVEL IN dB 
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HEARING THRESHOLD LEVEL IN dB 
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DATA POINTS FOR ANALYSIS OF VARIANCE 
MEDIAN VOT'S FOR THE LABIAL SERIES 
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APPENDIX “<C> (continued) 


MEDIAN VOT'S FOR THE DENTAL SERIES 


VoL_ced 


S/N Ratio 
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